Cleaning the GenBank Arabidopsis thaliana data set.
نویسندگان
چکیده
Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks. However, the possibilities are drastically impaired if the stored data is unreliable. During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank. A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate. More than 15% of the most important entries extracted did contain erroneous information. In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing. In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common. It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated--also at the submitter level--to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP.
منابع مشابه
Differential Expression of Arabidopsis thaliana Acid Phosphatases in Response to Abiotic Stresses
The objective of this research is to identify Arabidopsis thaliana genes encoding acid phosphatases induced by phosphate starvation. Multiple alignments of eukaryotic acid phosphatase amino acid sequences led to the classification of these proteins into four groups including purple acid phosphatases (PAPs). Specific primers were degenerated and designed based on conserved sequences of PAPs isol...
متن کاملYeast Two Hybrid cDNA Screening of Arabidopsis thaliana for SETH4 Protein Interaction
SETH4 coding sequence with 2013 bp is a member of gene family expressed in gametophytic tissues of Arabidopsis thaliana. This fragment was PCR amplified using Kod Hi Fi DNA polymerase enzyme. This fragment was cloned into pGBKT7 bate vector and transformed E. coli DH5? cells containing vector were selected on LB medium containing Kanamycin. Finally, pGBKT7-SETH4 bate was transformed into yeast ...
متن کاملAll about Arabidopsis
Arabidopsis thaliana has grown from a lowly weed into a model organism of lofty stature and spawned a rapidly growing field of research. A wealth of on-line resources has arisen to coordinate the multicenter Arabidopsis sequencing project. The Arabidopsis Genome Initiative (AGI) provides up-to-date information about ongoing efforts to sequence the five chromosomes of A. thaliana. As of March 20...
متن کاملDAtA: Database of Arabidopsis thaliana Annotation
The Database of Arabidopsis thaliana Annotation (D At A) was created to enable easy access to and analysis of all the Arabidopsis genome project annotation. The database was constructed using the completed A.thaliana genomic sequence data currently in GenBank. An automated annotation process was used to predict coding sequences for GenBank records that do not include annotation. D At A also con...
متن کاملNegative control of Strictisidine synthase like-7 gene on salt stress resistance in Arabidopsis thaliana
Strictosidine synthase-like (SSL) is a group of gene families in the Arabidopsis genome, which whose orthologues in other plants are key enzymes in mono-terpenoid indole-alkaloid biosynthesis pathway. The SSL7 is upregulated upon treatments of Arabidopsis plants with signaling molecules such as SA, methyl jasmonate and ethylene. To find the functional role of the gene, a T-DNA-mediated knockout...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Nucleic acids research
دوره 24 2 شماره
صفحات -
تاریخ انتشار 1996